Section 3 - The five verbs and select in more detail

The five verbs and their meaning

The dplyr package contains five key data manipulation functions, also called verbs:

select(), which returns a subset of the columns,
filter(), that is able to return a subset of the rows,
arrange(), that reorders the rows according to single or multiple variables, mutate(), used to add columns from existing data,
summarize(), which reduces each group to a single row by calculating aggregate measures.

If you want to find out more about these functions, consult the documentation by clicking on the functions above.

Choosing is not losing! The select verb

To answer the simple question whether flight delays tend to shrink or grow during a flight, we can safely discard a lot of the variables of each flight. To select only the ones that matter, we can use select().

As an example, take the following call, that selects the variables var1 and var2 from the data frame df.

select(df, var1, var2)

You can also use : to select a range of variables and - to exclude some variables, similar to indexing a data.frame with square brackets. You can use both variable’s names as well as integer indexes. This call selects the four first variables except for the second one of a data frame df:

select(df, 1:4, -2)

select() does not change the data frame it is called on; you have to explicitly assign the result of select() to a variable to store the result.

# Load the hflights package
library(hflights)

hflights_df <- hflights[sample(nrow(hflights), 720), ] 
hflights <- as_tibble(hflights_df)

# Print out a tbl with the four columns of hflights related to delay
head(select(hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay))
## # A tibble: 6 x 4
##   ActualElapsedTime AirTime ArrDelay DepDelay
##               <int>   <int>    <int>    <int>
## 1               113      92      -12       -5
## 2               254     235       -7        2
## 3               169     148       73       84
## 4                51      32      -11      -13
## 5                80      65        3       15
## 6                56      41       -3        1
# Print out the columns Origin up to Cancelled of hflights
head(select(hflights, 14:19))
## # A tibble: 6 x 6
##   Origin Dest  Distance TaxiIn TaxiOut Cancelled
##   <chr>  <chr>    <int>  <int>   <int>     <int>
## 1 HOU    ATL        696      7      14         0
## 2 IAH    PDX       1825      5      14         0
## 3 IAH    SLC       1195     13       8         0
## 4 IAH    SHV        192      3      16         0
## 5 IAH    MEM        468      5      10         0
## 6 HOU    DAL        239      3      12         0
# Answer to last question: be concise!
head(select(hflights, 1:4, 12:21))
## # A tibble: 6 x 14
##    Year Month DayofMonth DayOfWeek ArrDelay DepDelay Origin Dest  Distance
##   <int> <int>      <int>     <int>    <int>    <int> <chr>  <chr>    <int>
## 1  2011     1          6         4      -12       -5 HOU    ATL        696
## 2  2011     6          2         4       -7        2 IAH    PDX       1825
## 3  2011     7         23         6       73       84 IAH    SLC       1195
## 4  2011     9          9         5      -11      -13 IAH    SHV        192
## 5  2011    12         28         3        3       15 IAH    MEM        468
## 6  2011     1         30         7       -3        1 HOU    DAL        239
## # ... with 5 more variables: TaxiIn <int>, TaxiOut <int>, Cancelled <int>,
## #   CancellationCode <chr>, Diverted <int>

Bear in mind that select() does not modify the original dataset. Instead, select() returns a new, modified copy. Therefore, choosing is not losing! You should save the result to a new variable if you want to be able to use it later.

Helper functions for variable selection

dplyr comes with a set of helper functions that can help you select groups of variables inside a select() call:

starts_with("X")  # every name that starts with "X",
ends_with("X")    # every name that ends with "X",
contains("X")     # every name that contains "X",
matches("X")      # every name that matches "X", where "X" can be a regular expression,
num_range("x", 1:5) # the variables named x01, x02, x03, x04 and x05,
one_of(x)         # every name that appears in x, which should be a character vector.

Pay attention here: When you refer to columns directly inside select(), you don’t use quotes. If you use the helper functions, you do use quotes.

# Print out a tbl containing just ArrDelay and DepDelay
head(select(hflights, ends_with(c('Delay'))))
## # A tibble: 6 x 2
##   ArrDelay DepDelay
##      <int>    <int>
## 1      -12       -5
## 2       -7        2
## 3       73       84
## 4      -11      -13
## 5        3       15
## 6       -3        1
# Print out a tbl as described in the second instruction, using both helper functions and variable names
head(select(hflights, UniqueCarrier, ends_with(c('Num')), starts_with(c('Cancel'))))
## # A tibble: 6 x 5
##   UniqueCarrier FlightNum TailNum Cancelled CancellationCode
##   <chr>             <int> <chr>       <int> <chr>           
## 1 DL                 1204 N964DL          0 ""              
## 2 CO                  209 N38403          0 ""              
## 3 OO                 4497 N813SK          0 ""              
## 4 XE                 4682 N21129          0 ""              
## 5 XE                 4364 N13929          0 ""              
## 6 WN                   44 N676SW          0 ""
# Print out a tbl as described in the third instruction, using only helper functions.
head(select(hflights, ends_with("Time"), ends_with("Delay")))
## # A tibble: 6 x 6
##   DepTime ArrTime ActualElapsedTime AirTime ArrDelay DepDelay
##     <int>   <int>             <int>   <int>    <int>    <int>
## 1    1244    1537               113      92      -12       -5
## 2    2137    2351               254     235       -7        2
## 3    1544    1733               169     148       73       84
## 4    2052    2143                51      32      -11      -13
## 5    1620    1740                80      65        3       15
## 6    1701    1757                56      41       -3        1

This exercise demonstrates the power of the helper functions. They can limit the amount of coding drastically. Beware of too obscure queries however; you still want your code to be clear for fellow data scientists!

Comparison to base R

To see the added value of the dplyr package, it is useful to compare its syntax with base R. Up to now, you have only considered functionality that is also available without the use of dplyr. The elegance and ease-of-use of dplyr is a great plus though.

# Finish select call so that ex1d matches ex1r
ex1r <- hflights[c("TaxiIn", "TaxiOut", "Distance")]
ex1d <- select(hflights, starts_with("Taxi"), Distance)

# Finish select call so that ex2d matches ex2r
ex2r <- hflights[c("Year", "Month", "DayOfWeek", "DepTime", "ArrTime")]
ex2d <- select(hflights, Year, Month, DayOfWeek, DepTime, ArrTime)

# Finish select call so that ex3d matches ex3r
ex3r <- hflights[c("TailNum", "TaxiIn", "TaxiOut")]
ex3d <- select(hflights, TailNum, starts_with("Taxi"))

Section 4 - The second of five verbs: mutate

Mutating is creating

mutate() is the second of five data manipulation functions you will get familiar with in this course. mutate() creates new columns which are added to a copy of the dataset.

Take this example that adds a new column, z, which is the element-wise sum of the columns x and y, to the data frame df:

mutate(df, z = x + y)

# Add the new variable ActualGroundTime to a copy of hflights and save the result as g1.
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)

# Add the new variable GroundTime to g1. Save the result as g2.
g2 <- mutate(g1, GroundTime = TaxiIn + TaxiOut)

# Add the new variable AverageSpeed to g2. Save the result as g3.
g3 <- mutate(g2, AverageSpeed = 60 * Distance / AirTime)

# Print out g3
g3 %>%
  head(10) %>% 
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = F, position = "center", , font_size = 11) %>%
  row_spec(0, bold = T, color = "white", background = "#3f7689")
Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum TailNum ActualElapsedTime AirTime ArrDelay DepDelay Origin Dest Distance TaxiIn TaxiOut Cancelled CancellationCode Diverted ActualGroundTime GroundTime AverageSpeed
2011 1 6 4 1244 1537 DL 1204 N964DL 113 92 -12 -5 HOU ATL 696 7 14 0 0 21 21 453.9130
2011 6 2 4 2137 2351 CO 209 N38403 254 235 -7 2 IAH PDX 1825 5 14 0 0 19 19 465.9574
2011 7 23 6 1544 1733 OO 4497 N813SK 169 148 73 84 IAH SLC 1195 13 8 0 0 21 21 484.4595
2011 9 9 5 2052 2143 XE 4682 N21129 51 32 -11 -13 IAH SHV 192 3 16 0 0 19 19 360.0000
2011 12 28 3 1620 1740 XE 4364 N13929 80 65 3 15 IAH MEM 468 5 10 0 0 15 15 432.0000
2011 1 30 7 1701 1757 WN 44 N676SW 56 41 -3 1 HOU DAL 239 3 12 0 0 15 15 349.7561
2011 5 8 7 1911 2139 CO 670 N14250 268 248 20 11 IAH SFO 1635 5 15 0 0 20 20 395.5645
2011 7 16 6 651 913 MQ 3859 N543MQ 142 123 -17 -4 IAH ORD 925 6 13 0 0 19 19 451.2195
2011 9 18 7 1942 2029 XE 4706 N14938 47 35 115 122 IAH LFT 201 4 8 0 0 12 12 344.5714
2011 5 14 6 1859 2223 XE 2438 N14174 144 119 11 -1 IAH CVG 871 7 18 0 0 25 25 439.1597

Add multiple variables using mutate

So far you’ve added variables to hflights one at a time, but you can also use mutate() to add multiple variables at once. To create more than one variable, place a comma between each variable that you define inside mutate().

mutate() even allows you to use a new variable while creating a next variable in the same call. In this example, the new variable x is directly reused to create the new variable y:

mutate(my_df, x = a + b, y = x + c)

# Add a second variable loss_ratio to the dataset: m1
m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_ratio = loss / DepDelay)

# Add the three variables as described in the third instruction: m2
m2 <- mutate(hflights, TotalTaxi = TaxiIn + TaxiOut, ActualGroundTime = ActualElapsedTime - AirTime, Diff = TotalTaxi - ActualGroundTime)

Session info

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 16299)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252   
## [3] LC_MONETARY=German_Switzerland.1252 LC_NUMERIC=C                       
## [5] LC_TIME=German_Switzerland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] hflights_0.1     ggplot2_3.1.0    dplyr_0.8.0.1    gapminder_0.3.0 
## [5] kableExtra_1.0.1 knitr_1.21      
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0        highr_0.7         plyr_1.8.4       
##  [4] pillar_1.3.1      compiler_3.5.2    prettydoc_0.2.1  
##  [7] tools_3.5.2       digest_0.6.18     gtable_0.2.0     
## [10] evaluate_0.12     tibble_2.0.1      viridisLite_0.3.0
## [13] pkgconfig_2.0.2   rlang_0.3.1       cli_1.0.1        
## [16] rstudioapi_0.9.0  yaml_2.2.0        xfun_0.4         
## [19] withr_2.1.2       httr_1.4.0        stringr_1.4.0    
## [22] xml2_1.2.0        hms_0.4.2         webshot_0.5.1    
## [25] grid_3.5.2        tidyselect_0.2.5  glue_1.3.0       
## [28] R6_2.4.0          fansi_0.4.0       rmarkdown_1.11   
## [31] readr_1.3.1       purrr_0.3.0       magrittr_1.5     
## [34] scales_1.0.0      htmltools_0.3.6   assertthat_0.2.0 
## [37] rvest_0.3.2       colorspace_1.4-0  utf8_1.1.4       
## [40] stringi_1.3.1     lazyeval_0.2.1    munsell_0.5.0    
## [43] crayon_1.3.4